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ABSTRACT 


The  Defense  Department  has 
developed  a  computerized  adaptive 
testing  (CAT)  version  of  the  Armed 
Services  Vocational  Aptitude  Battery 
(ASVAB) .  During  the  Mechanical 
Maintenance  phase  of  the  Marine  Corps 
Job  Performance  Measurement  project, 

CAT -ASVAB  was  administered  to  over 
1,400  Marines  in  Automotive  Repair  and 
Helicopter  Repair  occupations.  The 
scores  of  these  Marines  were  analyzed  to 
assess  the  reliability  of  CAT-ASVAB,  the 
potential  effect  of  test  Item  compro¬ 
mise,  and  how  the  use  of  computers  has 
affected  the  nature  of  speed  tests. 

This  research  memorandiim  presents  the 
results  of  the  analysis. 
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EXECUTIVE  SUMMARY 


The  Defense  Department  uses  the  Armed  Services  Vocational  Aptitude 
Battery  (ASVAB)  for  selection  and  classification  of  enlisted  personnel. 

A  computerized  adaptive  testing  (CAT)  version  of  the  ASVAB  has  been 
developed  by  the  Navy  Personnel  Research  and  Development  Center.  As  a 
part  of  the  Marine  Corps  Job  Performance  Measurement  (JPM)  project, 

CAT- ASVAB  was  administered  to  over  1,400  Marines  In  ground  repair  and 
helicopter  repair  occupations.  Over  200  of  the  Marines  took  CAT-ASVAB  a 
second  time  after  a  week  or  two.  This  research  memorandum  analyzes 
these  data  to  address  some  Issues  related  to  the  operational  use  of 
CAT-ASVAB. 

ISSUES 

A  CAT-ASVAB  subtest  contains  fewer  questions  than  the  standard 
paper  pencil  (PP)  version  of  that  subtest.  Although  the  scores  on 
shorter  tests  tend  to  be  less  precise,  CAT  questions  are  chosen  to 
maximize  the  measurement  precision  of  the  test  score  for  the  person 
being  tested,  which  counteracts  the  effect  of  the  test  being  shorter. 

It  Is  Important  to  determine  If  CAT-ASVAB  provides  as  much  measurement 
precision  as  the  PP  version. 

For  the  ASVAB  to  provide  meaningful  scores,  examinees  should  have 
no  prior  knowledge  of  questions  on  the  test.  A  recruiter  who  has  come 
to  know  some  ASVAB  questions  and  tells  an  applicant  what  they  are  could 
potentially  Increase  that  applicant's  score.  To  reduce  the  Impact  of 
potential  compromise,  an  applicant  Is  given  a  randomly  chosen  form  from 
six  operational  forms  of  PP  ASVAB.  CAT  has  only  two  forms  at  present 
but,  because  of  Its  adaptive  nature,  no  two  CAT  tests  use  exactly  the 
same  set  of  questions.  Yet,  some  prior  knowledge  of  questions  may  help 
an  examinee  score  higher.  No  empirical  study  of  this  Issue  Is 
available . 

The  ASVAB  contains  two  tests  of  speed:  Numerical  Operations  (NO) 
and  Coding  Speed  (CS).  NO  measures  the  rate  of  performing  simple 
arithmetic  operations.  CS  requires  rapid  recognition  of  words  and 
numbers.  Use  of  a  computer  In  CAT-ASVAB  has  Improved  the  accuracy  of 
these  tests  because  the  computer  can  measure  the  time  spent  on 
Individual  questions.  Thus,  It  Is  possible  that  computerized  adminis¬ 
tration  has  changed  the  nature  of  the  aptitude  measured  by  NO  and  CS. 

RESULTS 

Based  on  results  of  the  analysis  In  this  report.  It  Is  evident  that 
CAT-ASVAB  provides  more  precise  measurement  of  aptitudes  than  does  the 
PP  version  of  the  test.  The  Increase  In  the  predictive  power  of  the 
selection  and  classification  composites  Is  no  more  than  1  percent, 
however . 
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If  an  examinee  takes  CAT-ASVAB  twice,  scores  on  the  second  test 
tend  to  be  hi^er  If  the  same  form  of  CAT  Is  tised  on  both  the  first  and 
second  tests,  indicating  that  prior  knowledge  of  some  questions  does 
help  raise  scores.  Hence,  it  cannot  be  assumed  that  CAT  provides 
protection  against  compromise  of  test  questions. 

Finally,  the  aptitude  measured  by  Coding  Speed  is  the  same  in  CAT 
and  PP  versions  of  the  tests.  The  CAT  version  of  NO  measures  not  only 
speed  but  mathematical  aptitude  as  well.  Although  computerized  adaptive 
testing  provides  more  accurate  measurement  of  aptitude,  the  practical 
value  of  this  Improvement  is  unclear. 
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INTRODUCTION 


The  Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  is  used  for 
selection  and  classification  of  enlisted  personnel.  It  contains  ten 
subtests- -General  Science  (GS) ,  Arithmetic  Reasoning  (AR) ,  Word 
Knowledge  (WK) ,  Paragraph  Comprehension  (PC) ,  Numerical  Operations  (NO) , 
Coding  Speed  (CS) ,  Auto  and  Shop  Information  (AS) ,  Mathematics  Knowledge 
(MK) ,  Mechanical  Comprehension  (MC) ,  and  Electronics  Information  (El). 
The  Verbal  (VE)  raw  score  is  defined  as  the  sum  of  WK  and  PC  scores. 
Subtests  NO  and  CS  are  tests  of  speed  in  handling  numerical  and  symbolic 
material.  All  others  are  power  tests  with  liberal  time  limits. 

Standard  scores  rather  than  raw  scores  on  the  subtests  are  used  in  all 
decisions  based  on  the  ASVAB.  Standard  scores  are  Integers  from  20  to 
80,  with  mean  50  and  standard  deviation  10  in  the  1980  reference 
population  [1]. 

Standard  scores  on  subtests  are  combined  into  the  Airmed  Forces 
Qualification  Test  (AFQT)  score,  which  j.s  the  same  for  all  services,  and 
into  occupational  composites,  which  vary  from  one  service  to  another. 

The  AFQT  is  the  primary  score  for  selection.  It  contains  subtests  VE 
(with  double  weight) ,  AR,  and  MK.  The  AFQT  score  may  be  expressed  as  a 
sum  of  standard  scores  or  as  a  percentile  rank;  the  former  is  more 
convenient  for  statistical  analysis  whereas  the  latter  is  the  score  used 
for  selection.  Composite  scores  are  used  to  classify  a  recruit  into  a 
military  occupational  specialty  (MOS) .  The  Marine  Corps  uses  four 
composites:  (1)  Mechanical  Maintenance  (MM),  containing  AR,  AS,  MC,  and 
El;  (2)  Clerical  (CL),  containing  VE,  MK,  and  CS;  (3)  Electronics  (EL), 
containing  GS,  AR,  MK,  and  El;  and  (4)  General  Technical  (GT) , 
containing  VE,  AR,  and  MC.  Scores  on  these  composites  have  mean  100  and 
standard  deviation  20  in  the  reference  population. 

In  the  computerized  adaptive  testing  (CAT)  version  of  the  ASVAB, 
there  is  a  large  pool  of  items  for  each  subtest.  Different  items  are 
administered  to  different  examinees  in  an  attempt  to  maximize  measure¬ 
ment  precision  for  each  examinee.  The  adaptive  strategy  of  item 
selection  consists  of  using  the  current  estimate  of  the  examinee's 
ability  to  select  the  next  item.  The  next  item  is  chosen  to  maximize 
the  Information  it  will  provide  about  the  examinee's  ability.  As  a 
result  of  this  procedure,  a  CAT  subtest  of  any  given  length  is  more 
reliable  than  a  PP  version  containing  the  same  number  of  items.  Using 
the  results  obtained  in  equating  studies,  CAT  scores  are  expressed  as 
raw  number  correct  scores  equivalent  to  those  on  PP  Form  8a.  These  are 
converted  to  standard  scores  xislng  the  same  table  as  the  one  used  with 
Form  8a.  Tables  for  converting  CAT  ability  scores  to  raw  scores  were 
provided  by  the  Navy  Personnel  Research  and  Development  Center. 

The  Marine  Corps  is  conducting  a  multiphase  Job  Performance 
Measurement  (JPM)  project.  In  the  Mechanical  Maintenance  (MM)  phase  of 
this  project,  CAT-ASVAB  was  administered  to  over  1,400  Marines  in 
Automotive  Repair  and  Helicopter  Repair  occupations.  If  a  Marine's  CAT 
scores  were  higher  than  his  or  her  previous  ASVAB  scores,  the  CAT  scores 
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became  the  scores  of  record.  The  chance  to  improve  the  scores  of  record 
motivated  the  Marines  to  perform  well  on  CAT.  The  CAT-ASVAB  was  re- 
administered  to  about  200  randomly  selected  Marines  a  week  or  two 
later.  This  subsample  yields  reliabilities  of  CAT-ASVAB  subtests  and 
composites.  The  total  sample  is  useful  for  other  analyses.  Data  on 
each  Marine  Included  ASVAB  standard  scores  on  the  test  taken  for 
enlistment.  Because  the  nxunber  of  women  in  the  study  was  small,  and 
because  there  is  evidence  that  the  CAT  version  tends  to  underestimate 
their  aptitude  in  Auto-Shop  Information,  women  were  excluded  from  all 
analyses . 

RELIABILITY  OF  CAT-ASVAB  SUBTESTS 

Each  CAT-ASVAB  subtest  contains  fewer  items  than  its  PP  version. 
Shorter  tests  provide  less  reliable  measures  of  performance  than  longer 
ones,  other  things  being  equal.  On  the  other  hand,  the  adaptive  nature 
of  CAT  makes  it  more  reliable  than  a  PP  test  of  the  same  length,  if  the 
average  item  in  the  CAT  pool  is  as  good  as  the  average  item  in  the  PP 
version.  The  empirical  question  is:  given  the  test  lengths  and  item 
pools  of  CAT-ASVAB,  how  do  the  reliabilities  of  its  subtests  and 
composites  compare  with  those  in  the  PP  version? 

The  sample  available  for  reliability  analyses  consisted  of  202 
Marines  who  were  administered  the  CAT  test  twice.  Because  of  policy 
constraints,  scores  from  the  second  test  could  not  be  used  to  Improve  an 
individual's  scores  of  record.  Therefore,  some  examinees  may  have  been 
less  motivated  on  the  second  test,  indicating  that  some  data  editing  was 
needed  to  remove  possibly  unmotivated  examinees.  In  each  administra¬ 
tion,  scores  on  each  power  sub test  were  standardized  to  mean  50  and 
standard  deviation  10  in  the  sample.  Denote  these  by  and  2^2 i  where 
subscript  s  indicates  the  subtest  and  1,  2  represent  first  and  second 
administrations  of  the  test.  For  each  Marine,  two  indices  were  created 
from  these  standardized  scores.  The  first  index,  based  on  the  total 
score  on  all  power  subtests,  was  given  by 

Q-  -  (Z  Z  ,  -  S  Z  . 
u  's  si  s  s2 


The  second  index  was 


^2  -  i  <2sl  -  2s2>' 


where  each  sum  was  takeii  over  all  eight  power  subtests.  Distributions 
of  both  indices  were  examined,  and  cutoffs  set  near  the  95th  percen¬ 
tile.  The  cutoff  was  90  for  and  210  for  Q2.  Marines  were  retained 
for  analysis  if  both  Indices  were  below  their  cutoffs,  which  led  to  a 
useful  sample  of  188  persons. 
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There  are  two  equivalent  forms  of  CAT-ASVAB.  The  CAT  form  used  on 
the  retest  was  chosen  at  random,  without  attention  to  which  form  was 
used  In  the  original  test.  As  a  result,  91  Marines  got  the  same  form  on 
retest  as  on  the  Initial  one,  and  97  got  a  different  form.  Only  the 
latter  subsample  was  used  to  calculate  reliabilities . 

An  estimate  of  reliability  In  the  recruit  population  Is  given  by 
the  correlation  of  a  subtest  from  Initial  test  to  retest  on  a  different 
form.  These  reliabilities  have  little  operational  significance  because 
recruits  have  already  been  selected  on  the  basis  of  their  scores  on  the 
enlistment  ASVAB.  To  decide  how  well  CAT  will  work  In  selection  and 
classification,  reliabilities  In  the  tinres trie ted  national  population 
are  needed.  The  conversion  of  statistics  from  recruits  to  the  national 
population  Is  termed  "correction  for  range  restriction.”  To  make  such  a 
correction.  It  Is  assumed  that  the  error  variance  of  any  score,  on  a 
subtest  or  a  composite.  Is  the  same  In  both  populations.  Variances  of 
standard  scores  In  the  national  population  are  known.  Therefore,  It  Is 
easy  to  compute  corrected  reliabilities.  Take  GS  as  an  example.  Its 
variances  in  the  recruit  sample  and  the  national  population  are  35.0  and 
100;  Its  reliability  In  the  sample  Is  0.739.  Hence,  Its  error  variance 
Is  EVAR(GS)  -  35.0  (1  -  0.739)  -•  9.14,  and  Its  corrected  reliability  Is 
1  -  9.14/100  -  0.909.  The  same  method  applies  to  composites  as  well, 
except  that  the  error  variance  of  a  composite  Is  the  sum  of  error 
variances  of  subtests  In  the  composite.  Take  AFQT  as  an  example.  The 
AFQT  sum  of  standard  scores  Is  AFX)T  -  2  VE  +  AR  +  MK,  and  hence.  Its 
error  variance  Is 

EVAR(AFQT)  -  4  EVAR(VE)  +  EVAR(AR)  +  EVAR(MK) 

-  4  (2.6)  +  8.3  +  9.2  -  27.9  . 

Variance  of  AFQT  In  the  national  population  Is  1,321.2,  and  the 
corrected  reliability  Is  0.979. 

Reliabilities  of  CAT-ASVAB  scores  were  compared  with  those  of  FF 
scores .  Alternate  form  reliabilities  of  FF  subtests  are  available  In 
table  6  of  the  technical  supplement  to  the  counselor's  manual  for  the 
Student  Testing  Frogram  [2].  Composite  reliabilities  were  computed  from 
these  subtest  reliabilities  (as  was  done  for  CAT) . 

Table  1  contains  the  sample  standard  deviations  (on  the  Initial 
test),  sample  reliabilities,  error  variances,  corrected  reliabilities  of 
CAT  subtests  and  composites,  and  corrected  reliabilities  In  the  FF 
version.  Sample  standard  deviations  and  reliabilities  are  not  reported 
for  composites  because  the  range -corrected  reliabilities  were  computed 
directly  from  subtest  error  variances.  The  largest  fractional  Increase 
In  composite  reliability  Is  2.6  percent  for  MM. 

Assuming  that  CAT  and  FF  scores  measure  the  same  trait,  predictive 
validity  Is  proportional  to  the  square  root  of  reliability.  Hence,  the 
percent  Increase  In  validity  Is  about  half  that  In  reliability. 
Therefore,  according  to  composite  reliabilities  In  table  1,  an  Increase 
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In  reliability  raises  validity  In  the  population  by  no  more  than  the 
1.3  percent  Increase  for  MM.  The  small  size  of  this  gain  Is  caused  by 
the  fact  that  composite  reliabilities  of  PP  ASVAB  are  already  close  to 
the  maximum  possible  value  of  1. 


Table  1.  Sample  and  corrected  statistics  for  CAT  scores  and  corrected 
reliabilities  of  PP  scores 


_ CAT-ASVAB _ 

PP  ASVAB 

Sample 

standard 

Sample 

Error 

Corrected 

Corrected 

Score 

deviation 

reliability 

variance 

reliability 

reliability 

GS 

5.9 

.739 

9.1 

.909 

.830 

AR 

6.0 

.770 

8.3 

.917 

.900 

UK 

3.8 

.773 

3.4 

.966 

.920 

PC 

5.9 

.713 

9.8 

.902 

.710 

NO 

6.6 

.794 

9.0 

.910 

.820 

CS 

6.7 

.757 

10.8 

.892 

.840 

AS 

4.7 

.833 

3.6 

.964 

.860 

MK 

7.2 

.821 

9.2 

.908 

.870 

MC 

6.5 

.720 

12.0 

.880 

.820 

El 

5.7 

.697 

9.9 

.901 

.780 

VE 

4.0 

.837 

2.6 

.974 

.930 

AFQT 

27.9 

.979 

.961 

MM 

33.8 

.972 

.948 

CL 

22.6 

.965 

.945 

EL 

36.5 

.971 

.950 

GT 

22.9 

.967 

.950 

Based  on  these  results,  the  following  summary  statements  can  be 

made: 


o  GAT  scores  are  more  reliable  than  PP  scores. 

o  The  ratio  of  CAT  to  PP  reliability  Is  hl^er  for  subtests 
than  for  AFQT  and  the  composites.  In  the  latter  group, 
the  largest  percent  Increase  from  PP  to  CAT  reliability, 
observed  for  MM,  Is  2.5  percent. 

o  Composite  validities  can  be  expected  to  Increase  by  only 
about  1  percent  If  the  PP  version  Is  replaced  by  CAT. 
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SUSCEPTIBILITY  TO  COMPROMISE 


As  mentioned  above,  91  Marines  were  tested  with  the  same  form  on 
the  Initial  and  second  tests.  The  results  presented  below  Indicate  that 
some  Items  used  In  the  Initial  test  were  repeated  on  the  retest.  If  the 
examinees  tended  to  answer  these  questions  the  same  way  on  the  retest  as 
on  the  Initial  test,  this  commonality  makes  the  test- retest  correlation 
hl^er  than  the  correlation  that  appears  when  different  forms  were  used 
on  the  test  and  the  retest.  Such  a  difference  In  test- retest 
correlations  was  found;  the  average  difference  over  the  11  subtests  was 
0.08.  For  this  reason.  Marines  who  had  the  same  CAT  form  on  both  tests 
were  excluded  from  reliability  analyses. 

If  some  or  all  of  the  Marines  thou^t  about  the  test  questions  and 
learned  the  correct  answers  to  some  Items  they  had  answered  wrong  on  the 
first  test,  retest  scores  would  Increase.  Mean  scores  were  analyzed  to 
determine  If  this  phenomenon  had  occurred.  To  simplify  the  analysis, 
standard  scores  on  power  tests  (except  VE)  were  added  up  and  only  this 
sum  was  analyzed.  The  svim  on  the  Initial  test  was  subtracted  from  the 
retest.  The  mean  change  was  -4.60  points  when  Initial  and  retest  forms 
were  different;  that  Is,  the  mean  score  went  down,  which  Is  consistent 
with  the  fact  that,  on  retest,  there  was  no  Incentive  to  score  high. 

The  mean  change  was  1.13  points  when  the  forms  were  the  same.  The 
difference  between  the  means  Is  statistically  significant  at  the  0.01 
level.  Thus,  Marines  tended  to  score  higher  on  the  retest  when  the  form 
was  the  same  as  on  the  Initial  test.  This  shows  the  effect  of  prior 
exposure  to  some  of  the  Items,  even  In  the  absence  of  any  coaching. 

NATURE  OF  SPEED  TESTS 

As  seen  In  table  1,  CAT  versions  of  speed  tests  are  more  reliable 
than  PP  versions.  The  next  question  Is  whether  the  CAT  versions  measure 
anything  other  than  speed.  All  1,434  Marines  were  Included  In  the 
analyses  of  this  question,  and  only  the  Initial  test  scores  were  used. 

The  ASVAB  measures  four  factors:  speed,  verbal,  mathematical,  and 
technical  abilities  [3].  Scores  on  these  factors,  using  standard 
scores,  were  defined  as  follows:  VERB  -  GS  +  2  VE,  MATH  -  AR  +  MK,  and 
TECH  —  AS  +  MC  +  El.  For  each  Marine,  available  scores  Included  PP 
scores  from  the  enlistment  ASVAB  as  well  as  the  CAT  scores. 

CAT  NO  was  regressed  on  CAT  CS,  PP  NO,  and  PP  CS.  Similarly,  the 
other  speed  tests  were  regressed  on  the  remaining  speed  scores.  Then, 

In  all  regressions,  the  factor  scores  VERB,  MATH,  and  TECH,  In  the  same 
battery  as  the  dependent  variable,  were  added  to  the  list  of 
predictors .  Table  2  presents  the  sum  of  squares  explained  by  each 
predictor  when  It  Is  added  to  the  equation.  (It  should  be  remembered 
that  the  order  of  entering  speed  tests  Into  the  regression  varies  from 
one  dependent  variable  to  another.)  The  squared  multiple  correlation 
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(adjiisted  to  make  It  an  uiiblased  estimate)  Is  reported  for  regression  on 
speed  tests  only,  and  after  the  other  factors  have  been  added  to  the 
equation. 


Table  2.  Explained  sums  of  squares  and  sqxiared 
multiple  correlations  In  predicting  scores  on 
speed  tests 


_ Dependent  variable _ 

_ CAT _ 

PP 

NO 

CS 

NO 

CS 

SS(CAT  NO) 

13,369 

6,453 

4 

SS(CAT  CS) 

14,399 

0 

5,399 

SS(  PP  NO) 

8,277 

0 

14,293 

SS(  PP  CS) 

4 

5,835 

13,352 

RSQ 

0.338 

0.308 

0.354 

0.329 

SS(VERB) 

970 

581 

381 

110 

SS(MATH) 

4,120 

313 

66 

11 

SS(TECH) 

873 

14 

146 

196 

RSQ 

0.427 

0.322 

0.363 

0.333 

First,  consider  the  simpler  case  of  CS.  After  regressing  on  three 
speed  sub tests,  the  squared  multiple  correlation,  adjusted  for 
capitalizing  on  chance,  was  0.308.  After  adding  the  three  factor 
scores,  the  squared  multiple  correlation  Increased  to  only  0.322.  Thus, 
the  CAT  version  of  Coding  Speed  appears  to  measure  little  other  than 
speed  of  symbol  recognition.  For  NO,  the  adjusted  R-square  using  only 
the  speed  subtests  was  0.338.  After  adding  the  other  factors,  this 
Increased  to  0.427.  The  MATH  factor  was  responsible  for  69  percent  of 
the  additional  variance  explained.  Thus,  the  CAT  version  of  Ntimerlcal 
Operations  measures  speed  and,  to  a  smaller  extent,  mathematical 
aptitude.  Like  CAT  CS,  the  PF  versions  are  also  almost  pure  measures  of 
speed. 

SUMMARY 

This  report  addresses  three  Issues  concerning  the  computerized 
adaptive  version  of  the  ASVAB:  reliability,  the  potential  effect  of 
Item  compromise,  and  the  nature  of  speed  subtests.  Based  on  the 
analysis,  CAT-ASVAB  Is  more  reliable  than  the  PP  version.  For  composite 
scores  that  are  the  level  on  which  selection  and  classification 
decisions  are  made,  however,  the  Increase  Is  only  about  2  percentage 
points.  This  reflects  an  Increase  In  composite  validity  of  about 
1  percent.  In  translating  Increased  reliability  Into  Increased 
validity.  It  Is  asstimed  that  the  two  versions  measure  exactly  the  same 
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trait.  The  analysis  Indicates  that  this  asstuuptlon  does  not  hold  for 
Numerical  Operations.  Any  departures  from  the  asstimptlon  can  yield 
actual  validity  above  or  below  the  calculated  value.  At  present,  there 
Is  too  much  vincertalnty  In  estloiates  of  reliabilities  and  validities  to 
make  any  strong  statement.  Based  on  the  available  evidence,  It  appears 
that  an  Increase  In  predictive  validity,  resulting  from  hl^er 
reliability,  Is  slight. 

The  CAT-ASVAB  cost  benefit  analysis  [4]  claims  that  the  CAT  version 
provides  greater  protection  against  test  compromise  because  It  "makes 
Item  specific  training  of  examinees,  or  the  knowledge  of  Individual  test 
Items  In  advance,  essentially  useless,  (p.  E-4)”  Such  Is  not  the  case. 
CNA' s  analysis  of  mean  scores  shows  that  retest  scores  were  hl^er  on 
average  when  the  form  was  the  same  as  that  used  on  the  Initial  test. 
Thus,  despite  the  absence  of  coaching  or  even  of  any  pressing  reason  to 
score  high,  prior  knowledge  of  some  Items  tended  to  raise  scores. 

The  last  analysis  showed  that  the  CAT  version  of  Numerical 
Operations  measures  not  only  speed  In  making  simple  calculations,  but 
mathematical  aptitude  as  well.  This  may  be  due  to  the  fact  that.  In  the 
CAT  version,  questions  are  answered  by  pressing  keys,  which  takes  less 
time  than  filling  spaces  on  multiple  choice  answer  sheets.  Thus,  more 
of  the  time  spent  on  an  Item  Is  actually  used  for  the  numerical 
operation  Involved.  If  the  trait  measured  by  the  CAT  version  of  a 
subtest  differs  from  that  measured  by  the  PP  version,  standard  formulas 
that  relate  predictive  validity  to  reliability  cannot  be  used. 
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